6. Crawler and Typeahead

2021-03-06

Design a web crawler
Design thread-safe producer and consumer
Design a Typeahead

Web Crawler

Scenario

How many webpages? How long? How large?

Service

Crawler, TaskService, StorageService

Storage

Use db to store tasks, BigTable to store web pages

Scale

简单 crawler

Crawler

单线程 crawler

Crawler

Crawler

distributed crawler
- 每个机器抓取里自己最近的地方的网站
- 使用数据库存储每个网页的抓取状况，解决了在 queue 中 consumer 和 producer 速度不同的问题

Crawler

Typeahead

Scenario

typeahead

Services

QueryService
DataCollectionService

typeahead

Storage

提前存好 key-value 在表中，这样可以快速查询到
在 query 中使用 LIKE 很费时，所以直接存取 key-value 在表中
在用户的每次查询中获得 raw data, 然后存到表中

typeahead

更好的方法是用 trie

typeahead

Scale

降低 response time:
- frontend: cache result
- pre-fetch: 预加载数据, 预测用户会输入什么, 提前拿数据